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ABSTRACT 

V 

The  Massively  Parallel  Processor  (MPP)  is  a  128  by  128 
array  of  processing  elements  that  communicate  with  their 
horizontal  and  vertical  neighbors  by  shifting  data  one  bit 
at  a  time.  This  paper  describes  the  efficient  use  of  MPP  for 
various  types  of  image  processing  operations,  including  point 
and  local  operations,  discrete  transforms,  and  computation  of 
image  statistics.  A  comparison  between  MPP  and  ZMOB  (a  system 
consisting  of  256  microprocessors)  is  also  presented. 
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1.  Introduction 


1.1  MPP 

The  Massively  Parallel  Processor  (MPP)  is  a  128  by  128 
array  of  processing  elements  (PEs)  that  communicate  with  their 
horizontal  and  vertical  neighbors  by  shifting  data  one  bit 
at  a  time.  For  a  description  of  the  MPP  design  see  [1] .  In 
the  following  paragraphs  we  outline  only  a  few  basic  features 
of  MPP  that  are  needed  in  designing  image  processing  algorithms 
for  it. 

Each  image  processing  algorithm  implemented  on  MPP 
will  consist  of  two  phases:  computation  and  communication. 

To  support  the  computational  aspect  of  parallel  algorithms, 
each  PE,  while  being  a  "bit-slice"  processor,  is  capable  of 
supporting  a  complete  conventional  instruction  set.  Each  PE 
has  a  bit  addressable  local  memory  of  1024  bits  and  a  number 
of  fast  registers  to  support  arithmetic  and  interprocessor 
communication . 

Parallel  algorithms  generally  require  interprocessor 
communication:  to  accomplish  this,  every  PE  can  synchronously 

shift  data  to  its  north,  south,  east,  or  west  neighbor. 

(At  the  array  edges,  processor  passing  may  "wrap  around"  to 
the  PEs  at  the  other  end  of  the  row  or  column.)  When  loading 
data  from  the  host  machine,  a  128-long  bit  vector  may  be  passed 
to  the  128  edge  processors  all  at  once,  which  may  in  turn 
shift  it  across  the  image  while  the  rest  of  the  image  is  loaded. 
In  the  current  configuration  this  data  loading  occurs  over  a 
UNIBUS  from  a  VAX  host. 


This  paper  deals  with  the  efficient  use of  MPP  for 
performing  various  types  of  image  processing  operations, 
including  point  and  local  operations,  discrete  transforms, 
and  computation  of  image  statistics.  The  aim  is  to  make  the 
fullest  possible  use  of  MPP's  parallelism,  so  as  to  achieve 

2 

a  speedup  by  a  factor  proportional  to  the  number  of  PEs  (128 
16,384).  We  also  compare  MPP  processing  with  performing 
operations  on  the  host  VAX  itself,  as  well  as  with  processing 
on  ZMOB  (a  system  consisting  of  256  microprocessors  that 
communicate  via  a  fast  shift-register  bus) .  A  more  detailed 
treatment  of  image  processing  on  ZMOB  can  be  found  in  [ 2 ] . 


2 .  Point  Operations 


A  point  operation  on  an  image  maps  the  value  of  each 
pixel  into  a  new  value,  independent  of  the  values  of  other 
pixels.  The  image  is  divided  equally  among  the  PEs;  1  pixel/ 
processor  for  a  128  by  128  image,  4  pixels/processor  for  a 
256  by  256  image,  16  pixels/processor  for  a  512  by  512  image, 
and  so  on.  Images  much  larger  than  512  by  512  cannot  be  held 
in  the  1024  bits  of  local  memory  available  to  each  PE.  The 
PEs  are  loaded  with  the  image  data  from  the  host  VAX  over  the 
UNIBUS,  the  point  operation  is  performed,  and  the  results  are 
returned  to  the  host  VAX. 

To  compute  the  amounts  of  time  needed  to  perform  point 

operations  on  MPP  and  on  the  VAX,  let  and  be  the  times 

for  an  MPP  PE  and  for  the  VAX,  respectively,  to  perform  the 

given  operation  on  one  pixel.  In  an  N  by  N  image,  there  are 
2  .  2  2 

N  pixels;  thus,  C^N  and  C^N  /16,384  are  the  times  to  perform 
the  point  operation  on  the  VAX  and  MPP  (with  its  16,384  pro¬ 
cessors)  ,  respectively. 

However,  in  the  case  of  the  MPP,  there  is  also  the  amount 
of  image  loading  and  unloading  time  to  consider.  On  the  MPP, 
data  is  loaded  from  the  host  VAX,  via  the  UNIBUS,  to  a  staging 
area  of  the  MPP,  where  the  data  is  input  simultaneously  to  128 
edge  PEs,  128  bits  at  a  time.  Letting  r  be  the  rate  at  which 
a  byte  of  data  is  transferred  on  the  UNIBUS  (400nsec.,  and  p  be 
the  rate  at  which  a  bit  of  data  is  passed  between  PEs,  let  us 


compute  how  long  it  takes  to  load  a  128  by  128  (say)  image 
of  byte-long  pixels:  1)  from  the  VAX  to  MPP  staging  area 
via  the  UNIBUS,  and  2)  from  the  MPP  staging  area  to  the  PEs 
(a  concurrent  process).  Via  the  UNIBUS  it  takes  128  x  128  x  r, 
or  6.534insec.  From  the  staging  area  to  the  PEs,  it  takes 
128  X  128  X  8  bits  x  1/128  (number  of  bits  passed  simultaneously) 
X  p,  or  1.024]jsec.  Thus,  the  UNIBUS  is  the  rate-limiting 
step  of  the  MPP  image  loading  process,  and  the  total  time  to 
load  nnd  unload  is  rN^  +  rN^  =  2rN^. 

In  summary,  on  the  VAX,  the  time  to  perform  the  operation 
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on  the  entire  image  is  C^N  ,  while  the  time  to  perform  it  on 
the  MPP  is  2rN^  +  C^^N^/ie , 384 .  If  32,768r  +  <  16,384C^, 

using  the  MPP  is  faster  than  using  the  VAX. 

With  local  operations,  the  situation  is  more  complicated 
because  information  must  be  shared  between  neighboring  processors 
The  next  section  will  discuss  the  amount  of  time  it  takes  to 
perform  local  operations,  using  different  neighborhood  geometries 
A  comparison  with  performing  an  (iterated)  operation  on  the 
host  VAX  will  also  be  given.  Due  to  the  limited  local  memory 
of  MPP  PEs,  the  focus  of  the  discussion  will  be  the  one  pixel 
per  PE  case. 


3 .  Local  operations 

Each  iteration  of  a  local  operation  consists  of  two 
steps:  a  neighbor-passing  step,  and  a  computation  step 

involving  the  gathered  neighborhood.  Several  types  of  local 
neighborhoods  are  commonly  used,  and  these  (with  the  steps 
involved  in  passing  neighbors)  are  outlined  in  Figure  1. 

Every  passing  sequence  involves  the  exact  number  of  neighbors 
required,  except  for  the  8-neighbor  connected  component  case, 
where  one  extra  neighbor  transfer  occurs  (due  to  the  inter¬ 
connection  structure  of  MPP) .  In  all,  eight  pixels  are 
passed  in  the  8-neighbor  case;  four  pixels  in  the  4-neighbor 
case;  three  pixels  in  the  2x2  case;  five  pixels  in  the  8- 
neighbor  connected  component  case;  and  two  pixels  in  the 
4-neighbor  connected  component  case.  In  the  following  para¬ 
graphs  we  analyze  a  specific  case,  the  8-neighbor  local  operation, 
and  give  a  comparison  between  the  performance  of  MPP  and  of 
the  host  VAX  itself. 

When  is  using  MPP  better  than  simply  using  the  host  VAX? 

In  other  words,  when  does  the  overhead  of  using  MPP  (loading 
and  unloading  an  image  via  the  UNIBUS)  offset  the  time  saved 
in  performing  an  (iterated)  local  operation?  To  answer  this, 
we  must  first  obtain  formulas  for  computation  times  on  VAX 
and  MPP . 

We  will  assume  a  128  x  128  image,  thus  one  pixel  per  MPP 


PE.  The  relevant  parameters  are: 


N  =  length  of  image  side  =  128 

p  =  time  to  pass  one  bit  between  MPP  PEs 

m  =  number  of  bits  per  pixel  (8,  for  256  grey  levels) 


=  time  to  compute  one  local  operation  on  MPP 

=  time  to  compute  one  local  operation  on  VAX 

n  =  number  of  iterations  of  the  local  operation 

r  =  time  to  pass  one  pixel  over  the  UNIBUS 

On  the  VAX,  the  time  to  compute  n  iterations  of  a  local 

operation  taking  time  per  pixel  is 

2 

T  =  nC  N 
^VAX  "  V 

On  MPP ,  the  computation  must  be  split  into  three 

states;  Loading  (L^)  »  processing  (Pj^)  »  and  unloading  (Uj^)  . 

As  we  have  already  seen,  the  loading  of  the  MPP  PEs  is  limited 

by  the  amount  of  time  it  takes  to  transfer  the  image  pixels 

over  the  UNIBUS  (loading  of  the  PE.  from  that  point  is  much 

faster) .  Loading  and  unloading  times  are  the  same; 

L  =  U  =  rN^ 
m  m 

There  are  two  stages  for  each  iteration  of  a  local  operation 

on  MPP;  communication  and  computation.  For  an  eight-neighbor 

operation  with  one  pixel/PE,  the  pass  time  is  8mp  per  iteration, 

and  the  compute  time  is  per  iteration.  Thus, 

p  =  8nmp+  nC 
m  m 


In  summary,  the  total  time  for  MPP  processing  is  T, 


MPP  ^m  ^m  ^m. 


or 


MPP 


=  2rN  +  8nmp  +nC 


m 


s 


Given  that  the  VAX  takes  some  fraction  a  of  the  time 

that  an  MPP  PE  does  for  the  given  local  operation  (a  will 

vary) ,  how  time-consuming  must  that  local  operation  be  (on 

MPP,  say)  before  it  is  worth  moving  to  MPP  for  processing? 

Let  C  =  aC  ,  and  solve: 

V  m 

m  _  ip 

VAX  MPP 

anC  =  2rN^  +  8nmp  +nC 

m  2  ^ 

Q  =  2rN  +  8nmp 

anN^  -  n 

Tables  1  and  2  show  typical  results  for  the  realistic  values 
N  =  128 
m  =  8 

-7 

p  =  3x10  sec.  (300nsec/bit  PE  transfer  rate) 

-7 

r  =  4x10  sec.  (400nsec/byte  UNIBUS  transfer  rate) 
Table  1  gives  minimum  MPP  computation  times  for  =  "^mpp' 

Table  2  gives  minimum  times  for  T„,„  =  10T„„„ 

We  can  see  from  these  tables  that  MPP  will  usually  be 
advantageous  over,  and  often  more  than  ten  times  faster  than, 
the  VAX,  since  one  to  ten  microseconds  is  the  minimum  for  MPP 


PE  operations.  For  short  once-iterated  operations,  MPP  will 

-7  -3 

be  lO-bound;  for  C  between  10  and  10  sec.,  the  fractional 

m 

overhead  in  transferring  the  image  between  the  VAX  and  MPP  is 

-2  -1 
over  90%;  at  C  =10  sec.,  the  overhead  is  57%;  at  C  =10  sec 
m  m 

the  overhead  is  12%;  and,  at  higher  C  values  or  for  more  than 

m 

one  iteration,  the  overhead  drops  well  below  1%.  Generally, 
more  than  one  iteration  of  a  local  operation  must  be  performed 
before  MPP  is  useful. 


In  the  case  where  we  have  several  pixels  per  PE 

(N  by  N  image,  N  >  128),  the  situation  is  different.  For 

local  operations  on  images  larger  than  128  by  128,  the  general 

2 

formula  for  the  computation  time  is  /P  and  for  the  communica¬ 

tion  time  is  (4 (N/v^)  +  4)  {the  number  of  points  bordering 
2 

the  size  N  /P  subregion)  times  mp.  Thus,  with  increasing 
N  (within  the  constraint  of  the  limited  PE  local  memory) , 
the  computation  time  rises  by  the  square  and  the  communication 
time  rises  linearly  with  N;  consequently,  the  calculation  becomes 
more  CPU  bound.  In  any  case,  the  small  amount  of  memory  per 
PE  limits  the  number  of  pixels  that  can  be  handled  by  a  PE. 

The  values  of  a  pixel  and  its  eight  neighbors  already  take  up  a 
significant  fraction  of  this  memory  (72  bits,  or  about  7%). 

To  handle  a  2x2  block  of  pixels  and  their  neighbors  (a  4x4 
block  in  all)  requires  nearly  twice  this,  and  a  3x3  block 
with  neighbors  (5x5  in  all)  requires  40%  of  the  memory.  It 
would  be  difficult  to  handle  much  larger  blocks. 


lii  this  section  we  consider  some  MPP  tasks  involving 
computation  of  image  statistics  -  in  particular,  the  computa¬ 


tion  of  image  histograms  and  co-occurrenoe  matrices  on  MPP. 

4 • 1  Histograms 

The  histogram  algorithm  for  MPP  consists  of  two  main 
steps:  histogramming  the  columns  of  the  image  (creating  a 

histogram  for  the  pixels  in  each  column  with  the  "buckets" 
for  each  gray-level  residing  along  with  the  pixels  in  the  PEs 
of  each  row),  and  totalling  the  row  so  that  the  (e.g.)  left¬ 
most  column  of  PEs  contains  the  final  histogram  for  the  image. 
For  simplicity,  the  method  described  below  is  designed  for 
one  pixel  and  one  histogram  bucket  per  PE — a  128  x  128  image, 
and  128  (i.e.,  seven-bit)  gray  levels. 

a)  Histogramming  columns 

The  method  for  histogramming  the  columns  of  the  image 
involves  passing  the  gray-levels  cyclically  (and  synchronously) 
around  the  PEs  of  that  column,  using  the  "wraparound"  feature 
of  the  MPP  when  passing  pixels  between  processors.  The  goal  is 
to  have  the  processor  in  row  i  of  the  given  column  contain  a 
count  of  the  number  of  occurrences  of  gray  level  _i  in 

that  column.  In  this  example,  each  PE  sets  aside  an  eight-bit 
counter  for  the  histogram  "bucket"  and  cycles  the  seven-bit 
gray-levels  through  each  of  the  128  PEs  in  the  column.  Whenever 
a  gray-level  corresponding  to  the  row  number  of  the  PE  passes 


!  ^ 


through,  the  counter  in  that  PE  is  incremented  by  1.  This 
method  is  extensible  to  more  than  128  gray  levels;  the 
processors  simply  multiply  their  responsibility  for  gray 
levels  (e.g.,  two  each  for  256  or  four  each  for  512  gray 
levels);  this  is  similarly  true  for  larger  images.  Letting 
N  =  the  number  of  processors  in  the  column  (128)  and  m  = 
the  number  of  gray  levels  (128,  in  this  example) ,  the  com¬ 
plexity  of  this  part  of  the  algorithm  is  6 (nlogm) .  See  Figure 
2  for  an  example  of  an  eight-long  column  (and  eight  gray  levels) . 

b)  Totalling  rows 

Totalling  the  rows  to  obtain  the  final  histogram  is  done 
in  a  somewhat  more  complicated  fashion.  The  method  is  to 
pass  the  counters  derived  from  the  column  histogramming  step 
leftward  and  sum  them  at  each  level.  This  summing  may  be 
done  bit-by-bit  (by  adding  two  bits  and  saving  the  carry  for 
the  next  round) ,  since  they  must  be  passed  bitwise  anyway,  to 
save  time.  The  least  significant  bit  (LSB)  is  passed  leftward 
first,  and  this  is  added  to  the  LSB  of  the  held  counter  (with 
the  carry  saved  in  a  special  register) ;  the  LSB  of  the  resulting 
number  is  passed  at  the  next  step.  This  continues  until  the 
final  LSB  propagates  to  the  leftmost  column,  where  it  is  added 
to  that  column's  counter  and  results  in  the  LSB  of  the  final 
bucket  count.  Meanwhile,  the  next-to-last  bit  propagates 
leftward  after  the  LSB,  being  added  to  the  next-to-last  bit, 


and  the  carry  from  the  LSB  addition,  in  the  same  fashion, 
until  it  propagates  to  the  left  column. 

Since  larger  and  larger  counts  are  being  formed  as  the 
column  totals  merge,  the  counters  of  each  column  must  be 
extended  to  accommodate  these  sums.  For  column  N  (numbering 
from  1  at  the  right  to  128  at  the  left),  that  column's  counter 
must  be  extended  todlog^  Nj  +  8)  bits.  So  that  the  algorithm 
may  work  in  proper  synchrony,  every  bit  of  each  counter  must 
be  passed  upward,  even  leading  zeros.  Figure  3  presents  a 
worked  out  example  for  a  row  of  length  6 . 

Letting  N  =  the  number  of  processors  in  a  row  of  the 
processor  array  (128  on  MPP) ,  it  takes  N  steps  to  propagate 
the  LSB  to  the  left  column.  It  then  takes  (21og2N  -  1)  steps 
to  pass  the  rest  of  the  (21og2N)-bit  counter  maintained  by 
the  PE  in  the  second-to-left  column.  Thus,  this  part  of  the 
algorithm  takes  9 (N  +  log2N)  steps. 

The  total  complexity  of  histogramming  on  the  MPP  is  6 (nlog2m) 
(m  the  number  of  gray  levels)  from  the  first  part  plus  6  (N  +  log^N) 
in  the  second  part,  which  totals  to  6  (Nlogm)  . 

c)  Time  requirements 

The  first  step,  column  histogramming,  involves  cycling  N 
m-bit  pixels  through  the  column  PEs,  comparing  the  pixel  value 
to  the  row  number  and  (potentially)  incrementing  a  counter 
at  each  step  (note  that  on  an  SIMD  machine  such  as  MMP,  a  step 


such  as  this  incrementing  takes  just  as  much  time  whether  it 
occurs  or  not,  since  the  instruction (s)  must  be  sent  to 
each  processor  anyway;  they  are  simply  disarmed  if  necessary) . 
Thus,  at  each  of  N  steps,  an  m^bit  pass,  an  m-bit  compare,  and 
an  (n  +l)-bit  add  occur;  thus,  the  time  taken  for  column 
histogramming  is: 


Here 


=  N(mp  +  me  +  (n  +  l)a) 


N  =  length  of  image  side  =  128 
n  =  log2N  =  7 

m  =  number  of  bits  per  gray  level  =  7  (128  gray  levels) 

p  =  time  to  pass  one  bit  between  MMP  PEs  (lOOnsec.) 

a  =  time,  per  bit,  to  add  two  numbers  on  MPP  (lOOnsec.) 

c  =  time,  per  bit,  to  compare  two  numbers  on  MPP 
(400nsec. ) 

r  =  time  to  pass  one  pixel  over  the  UNIBUS  (400nsec.) 
The  MPP  instruction  timing  will  vary,  depending  on  the  exact 
programming  of  the  algorithm. 

For  the  second  step,  row  totalling,  there  are  (N  +  2n  -  1) 
steps  where  one  bit  is  passed  and  one  addition  takes  place; 
thus,  the  time  taken  for  row  totalling  is: 

Trow  =  (N  +  2n  -  1) (p  +  a) 

To  this  is  added  the  time  to  load  and  unload  the  image,  which 
is : 

2 

T  =  T  =  rN 

noad  unload  ^ 

The  total  time  for  histogramming  a  128  by  128  image  (128  gray 
levels)  on  MPP  is  thus 


4 . 2  Co-occurrence  matrices 

A  co-occurrence  matrix  is  essentially  a  "histogram"  of 
the  occurrences  of  pairs  of  gray  levels;  if  there  are  M  different 
gray  levels,  it  is  an  M  by  M  matrix.  To  compute  the  co-occurrence 
matrix  of  an  image,  the  neighbor  of  each  point  at  some  dis¬ 
placement  6  is  obtained,  and  the  appropriate  entry  (gray-level^^ , 
gray-level2)  of  the  matrix  incremented  by  one.  On  the  MPP, 
this  would  be  analogous  to  the  histogram  algorithm  presented 
earlier:  the  M  by  M  matrix  would  be  treated  as  a  size  M  histo¬ 

gram;  each  processor  would  be  responsible  for  M/128  rows  of 
the  matrix;  the  points  are  circulated  around  the  columns,  each 


PE  updating  appropriate  entries  of  its  rows;  finally,  these 
columns  are  passed  leftward  and  totalled. 

However,  since  there  are  only  1024  bits  (128  bytes) 
available  in  the  local  memory  of  each  PE,  the  largest  number 
of  values  which  can  be  accommodated  is  128  (with  no  room  to 
space)  or,  practically,  64.  Thus  co-occurrence  matrix  computa¬ 
tion  on  MPP  should  be  done  for  matrices  of  small  size,  e.g., 

8  by  8. 


5 .  Two-dimensional  discrete  transforms 

On  MPP,  the  following  method  calculates  the  two-dimensional 
Fourier  transform  (or  other  similar  discrete  transform)  of 
an  N  by  N  image  in  0 (N)  time.  The  process  is  composed  of  two 
steps:  the  discrete  transform  of  the  image  row-wise,  then  the 

discrete  transform  column-wise.  To  transform  the  rows,  each 
processor  computes  the  first  complex  term  it  will  use  in  its 
summation,  multiplies  it  by  the  pixel  value,  and  stores  the 
result  in  a  register.  Then  each  pixel  is  shifted  circularly, 
the  second  term  is  calculated,  multiplied,  added  to  the 
counter,  and  so  on.  This  process  is  repeated  similarly  for 
the  columns.  Each  takes  N  steps,  thus  the  algorithm  takes 
0(N)  time.  However,  while  this  method  does  well  on  128  by  128 
images  (one  pixel  per  PE) ,  the  processors  quickly  run  out  of 
local  memory  with  larger  images. 

MPP  is  also  very  limited  in  its  ability  to  perform  geo¬ 
metric  operations  on  images,  primarily  due  to  memory  con¬ 
straints.  Due  to  the  fixed  geometry  of  the  processo..s  and 
the  synchronous  nature  of  their  intercommunication,  unless 
each  processor  can  hold  the  block  of  data  it  needs  to  calculate 
the  values  of  the  output  pixels,  there  is  no  "smooth"  way  of 
getting  the  needed  data  to  its  destination  in  a  parallel  fashion. 


6.  Comparison  of  MPP  and  ZMOB 


Tables  3  and  4  show  the  performance  of  MPP  and  ZMOB, 
respectively,  at  various  basic  image  processing  tasks.  The 
MPP  table  uses  bits  as  the  basic  image  units,  whereas  the  ZMOB 
table  uses  pixels.  These  tables  include  total  complexity 
measures  for  computation  time,  communication  time,  and  memory 
requirements  as  a  function  of  image  size  (N,  the  diameter) , 
number  of  processors  (P) ,  the  number  of  gray  levels  (M) ,  and 
various  constants.  Tables  5  and  6  restate  this  information 
for  the  histogramming  algorithm,  based  on  the  relations  of 
P  and  M  to  N.  Note  that  a  factor  of  e(N^),  due  to  the  UNIBUS 
image  loading  and  unloading  step,  appears  in  each  communication 
complexity  formula,  separated  by  parentheses  from  the  inter¬ 
processor  communication  complexity. 

If  the  number  of  processors  in  ZMOB  is  regarded  as  pro¬ 
portional  to  the  image  diameter  (N) ,  and  the  number  of  pro- 

2 

cessors  in  MPP  as  proportional  to  image  size  (N  ) ,  then  we 
see  in  Tables  3  and  4  how  computational  complexity  decreases, 
but  intercommunication  complexity  increases,  when  the  relative 
number  of  processors  assigned  to  a  task  increases.  A  comparison 
of  the  actual  timings  of  a  histogram  algorithm,  in  Tables  1 
and  2,  and  Tables  3  and  4  in  [2],  show  that  in  reality,  the 
machines  are  quite  close  in  their  utility  relative  to  the  VAX. 


} 


7 .  Concluding  remarks 

Due  to  the  inflexible  intercommunication  structure  in 
MPP,  certain  algorithms  are  constrained  to  have  a  value  or 
values  propagate  from  one  end  of  the  array  to  the  other,  and 
thus  have  an  unavoidable  factor  of  N,  or  8N  for  one-byte 
data,  built  into  their  complexity.  In  addition,  other 
algorithms,  where  communication  does  not  occur  in  a  tightly 
orchestrated  way,  become  intractable.  The  severely  limited 
local  memory  space  is  also  a  difficulty  in  considering  certain 
algorithms  or  certain  (practical)  image  sizes.  Nevertheless, 
MPP  still  manifests  significant  speed  advantages,  particularly 
when  it  is  used  for  point  and  local  space-domain  operations 
or  for  transform-domain  filtering.  It  will  be  a  powerful  tool 
for  image  processing  and  analysis. 


Image  reconstruction  on  MPP  and  ZMOB 

The  two  methods  of  image  reconstruction  which  will  be 
discussed  for  implementation  on  MPP  and  ZMOB  are  the  Filtered 
Back  Projection  and  Fourier  reconstruction  methods.  The  former 
basically  involves  taking  each  point  of  a  density  projection 
and  "smearing"  its  value,  divided  by  an  appropriate  measure 
of  width,  across  the  image.  This  is  repeated  for  each  pro¬ 
jection,  its  points  being  smeared  additively,  with  suitable 
(pre-and)  post-processing  of  the  image  to  compensate  for  the 
spread  function  of  the  back  projection  process.  The  latter 
method  involves  taking  the  Fourier  transform  of  each  projection 
and,  by  applying  the  Fourier  Slice  Theorem  (which  states  that 
the  transforms  of  the  projections  are  the  values  of  the 
central  cross  sections,  at  the  same  orientations,  of  the 
transformed  image) ,  using  them  as  values  from  which  to  interpolate 
the  Cartesian-grid  representation  of  the  transformed  image, 
from  which  the  reconstructed  image  is  derived  by  inverse 
transformation . 

On  the  MPP,  the  first  method,  filtered  back-projection, 
is  difficult  due  to  the  non-linear  nature  of  the  reconstruction 
process.  The  problem  may  be  restated  thus:  for  any  point 
in  the  image,  what  points  from  each  projection  must  be  used 
to  get  (interpolate)  that  projection's  contribution  to  the 
final  value?  Since  the  projections  are  at  various  orientations, 
this  becomes  a  geometric  operation  problem  which,  except  for 


the  two-projection  situation,  is  of  a  form  that  the  fixed 
geometry  of  the  MPP  cannot  easily  handle- 


In  the  Fourier  reconstruction  method,  while  rows  of 
processors  may  be  able  to  transform  the  projections,  and 
the  projections,  once  in  place  among  the  appropriate  pro¬ 
cessors,  may  be  fairly  readily  interpolated  (and  the  image 
inverse  transformed  by  the  method  in  Section  5) ,  it  is  not 
clear  how  to  smoothly  get  the  transformed  projection  points 
to  the  processors  where  they  belong. 

For  image  reconstruction  on  ZMOB,  there  is  an  attractive 
way  to  implement  the  filtered  back-projection  method.  Given 
P  processors  and  projections,  the  circular  image  is  parti¬ 
tioned  into  2P  sectors,  and  each  processor  is  assigned  two 
opposite  sectors,  such  that  each  projection  bisects  each  pair 

of  sectors.  For  an  N  by  N  image,  each  sector  pair  will 
.  ttN^  . 

contain  approximately  points  (about  50  for  a  32  by  32 

image  with  16  partitions) .  Each  processor  is  then  loaded  with 
the  projection  data  assigned  to  it.  Each  point  in  the  sectors 
will  add  to  a  running  sum,  as  the  back-projected  contribution 
from  that  projection,  an  interpolated  value  depending  on  where 
a  line  from  the  point,  normal  to  the  projection,  falls  on 
the  projection.  After  the  first  projection  is  processed, 
each  processor  passes  those  values  to  its  next  neighbor,  then 
again  to  the  neighbor  two  over,  and  so  on  (note  that  in  later 
rounds,  the  normal  each  point  drops  onto  the  projection  takes 
into  account  the  ray  number  it  is  working  on) . 


To  calculate  the  computational,  communication,  and 
space  complexity  of  this  algorithm,  define  the  following 


variables : 


N  =  image  diameter  (N  by  N  image)  (and 
projection  length) 

P  =  number  of  processors  (and  projections) 

p  =  time  to  pass  one  point  between  processors 

C.  .  =  time  to  process  one  image  point  (interpolate 
and  sum) 

r  =  time  to  load  one  point  into  ZMOB  via  the 
UNIBUS 


The  computation  time  is  the  time  for  each  point  in  one 
processor's  allocation  of  the  image  (2  sectors)  to  be 
processed,  for  each  projection: 


comp 


j 


^int 


The  communication  time  will  consist  of  two  parts :  the 
time  to  pass  projections  between  processors,  and  the  time  to 
load  the  projection  data  (via  the  UNIBUS,  as  shown  earlier  to 
be  the  rate-determining  step).  Thus, 

=  PN„  +  2rN^ 
comm  p 

Finally,  the  amount  of  memory  required  is  that  for  the 
projection  and  the  portion  of  the  image; 


Memory  size 


”[i)' 

=  1^  +  N 


To  find  how  well  this  algorithm  compares  to  commercial 


algorithm  timings  (around  10  sec.)»  using  the  following 
representative  values: 

N  =  512  (512  by  512  image,  at  1mm  resolution) 

P  =  256 

p  =  10  ^sec.  (1 Op  sec. /byte  ZMOB  transfer  rate) 

_7 

r  =  4x10  sec.  (400nsec./byte  UNIBUS  transfer  rate) 


we  get : 


T  _  =  1.31  +  0.210  =  1.52sec. 

comm 

T  =  205776C.  ^ 
comp  int 

and  for:  =  (Ipsec.,  lOpsec.,  lOOpsec.) 


we  get; 


comp  =  (0.206sec.,  2.06sec.,  20.6sec.) 


for  a  total  time  of:  (1.73sec.,  3.58sec.,  22.1sec.) 

For  the  range  of  values  used,  which  should  be  realistic 

since  many  of  the  values  used  in  projection  normal  computation 
and  interpolation  may  be  precomputed  instead  of  computed 
"on-the-fly" ,  the  timings  for  ZMOB  image  reconstruction 
should  be  very  attractive  compared  to  commercial  systems. 
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TABLE  3:  MPP  SUMMARY  (MEASURES  PER  BIT) 
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TABLE  4:  ZMOB  SUMMARY  (MEASURES  PER  PIXEL) 
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TABLE  6;  ZMOB  histogranmiing  complexity: 
Computation 

Communication 

Memory 
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Figure  1.  MPP  passing  sequences  for  various 
types  of  neighborhoods 
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In  entry  a/b,  a  =  value  passing  through, 
b  =  counter  contents.  The  values  are 
cyclically  shifted  upward.  Each  counter 
adds  1  when  the  value  passing  through 
it  is  equal  to  its  row  number.  In  this 
example,  there  are  8  PEs  and  8  gray  levels. 


Figure  2.  Column  histogramming 
example 
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In  each  entry,  bits  that  have  just  been 
passed  are  underlined;  primes  denote 
positions  of  carry  bits 


Figure  3.  Row  totalling  example 
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Pattern  rec 
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The  Mas^vely  Parallel  1  roce^or  128  by  128  array  of 

processiry  elements  that  ci  mmuni^b^^^th  their  horizontal  and 
vertical^eighfepxA-.Jsy  Shi^'  ing  dam  one  bit  at  a  time.  This 
paper  ^toettsSes  the  efficii  nt  use  m  MPP  for  various  types  of  image 
proceM^ng  operations,  inc  uding  po^t  and  local  operations,  dis- 
crete^ransforms ,  and  compi  tation  of^mage  statistics.  A  compar- 
isoi^3?bv§en  MPP  and  ZMOB  a  system  consisting  of  256  micro- 
pro^ssorsT  is.  ..also  presen' ed.  \ 
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